Investigate a Dataset on Medical Appointment No Shows¶

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

Questions¶

  • Whats couse patient not to show up on their appointment
  • whats the relationship between thier desease and not showing up
  • ScheduledDay tells us on what day the patient set up their appointment.
  • Neighborhood indicates the location of the hospital.
  • Scholarship indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.

Import library to be used¶

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sas
import plotly_express as px

Data Wrangling¶

In [10]:
df= pd.read_csv("dataset/appointments.csv")
df.head()
Out[10]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [11]:
df.shape
Out[11]:
(110527, 14)
In [12]:
df.columns
Out[12]:
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')
In [13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
  • The ScheduledDay are object not datetime
  • The AppointmentDay are object not datetime
  • no-show is object
  • There are no missing values
In [14]:
df.isnull().sum().sum()
Out[14]:
0
  • No null values
In [15]:
df.describe().transpose()
Out[15]:
count mean std min 25% 50% 75% max
PatientId 110527.0 1.474963e+14 2.560949e+14 3.921784e+04 4.172614e+12 3.173184e+13 9.439172e+13 9.999816e+14
AppointmentID 110527.0 5.675305e+06 7.129575e+04 5.030230e+06 5.640286e+06 5.680573e+06 5.725524e+06 5.790484e+06
Age 110527.0 3.708887e+01 2.311020e+01 -1.000000e+00 1.800000e+01 3.700000e+01 5.500000e+01 1.150000e+02
Scholarship 110527.0 9.826558e-02 2.976748e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Hipertension 110527.0 1.972459e-01 3.979213e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Diabetes 110527.0 7.186479e-02 2.582651e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Alcoholism 110527.0 3.039981e-02 1.716856e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Handcap 110527.0 2.224796e-02 1.615427e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00
SMS_received 110527.0 3.210256e-01 4.668727e-01 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00
In [16]:
df_corr=df.corr()
df_corr.style.background_gradient(cmap='coolwarm', axis=None)
Out[16]:
  PatientId AppointmentID Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received
PatientId 1.000000 0.004039 -0.004139 -0.002880 -0.006441 0.001605 0.011011 -0.007916 -0.009749
AppointmentID 0.004039 1.000000 -0.019126 0.022615 0.012752 0.022628 0.032944 0.014106 -0.256618
Age -0.004139 -0.019126 1.000000 -0.092457 0.504586 0.292391 0.095811 0.078033 0.012643
Scholarship -0.002880 0.022615 -0.092457 1.000000 -0.019729 -0.024894 0.035022 -0.008586 0.001194
Hipertension -0.006441 0.012752 0.504586 -0.019729 1.000000 0.433086 0.087971 0.080083 -0.006267
Diabetes 0.001605 0.022628 0.292391 -0.024894 0.433086 1.000000 0.018474 0.057530 -0.014550
Alcoholism 0.011011 0.032944 0.095811 0.035022 0.087971 0.018474 1.000000 0.004648 -0.026147
Handcap -0.007916 0.014106 0.078033 -0.008586 0.080083 0.057530 0.004648 1.000000 -0.024161
SMS_received -0.009749 -0.256618 0.012643 0.001194 -0.006267 -0.014550 -0.026147 -0.024161 1.000000
In [17]:
df.nunique()
Out[17]:
PatientId          62299
AppointmentID     110527
Gender                 2
ScheduledDay      103549
AppointmentDay        27
Age                  104
Neighbourhood         81
Scholarship            2
Hipertension           2
Diabetes               2
Alcoholism             2
Handcap                5
SMS_received           2
No-show                2
dtype: int64
  • the AppointmentID is great than PatientId,that means many pationt have one or more appintment

Age Analyis¶

In [18]:
px.box(df,y='Age',title='The points outlier the Age column')
In [19]:
f,ax=plt.subplots(figsize=(8,8))
df_corr=df.corr()
sas.heatmap(df_corr,annot=True)
Out[19]:
<AxesSubplot:>
  • Relationship between age and high blood pressure is 0.5
In [20]:
df.hist(figsize=(10,10));
In [21]:
sas.pairplot(df,diag_kind='kde')
Out[21]:
<seaborn.axisgrid.PairGrid at 0x1536c2be0>
In [22]:
df['Age'].min()
Out[22]:
-1
In [23]:
df.loc[df['Age']<0]
Out[23]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z 2016-06-06T00:00:00Z -1 ROMÃO 0 0 0 0 0 0 No
In [24]:
sas.set()
f,ax=plt.subplots()
ax.hist(df['Age'])
plt.title('The the distribution of the age column',fontsize=20)
plt.ylabel('number of pationts ',fontsize=12)
plt.xlabel('Ages')
plt.show();
  • the average of age is 37
  • the oldest patient is 115
  • The percentage of people with Handicap is small
  • The proportion of people delivered to whom messages were delivered is less than the average
  • The percentage of people addicted to alcohol is small
  • The percentage of people who suffer from chronic diseases is a small percentage

Data Cleaning¶

In [25]:
df.rename(columns=lambda x: x.lower().replace('-','_'),inplace=True)
df.head()
Out[25]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [26]:
show=df.no_show=='Yes'
not_show=df.no_show=='No'
  • Change scheduledday , appointmentday and scheduledday_date to datetime
  • convert appointmentday_date to date
  • calculate waiting_time
In [27]:
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['scheduledday_date']=df['scheduledday'].dt.date
df['appointmentday']=pd.to_datetime(df['appointmentday'])

df['appointmentday_date']=df['appointmentday'].dt.date

df['waiting_time']=(df['appointmentday_date']-df['scheduledday_date']).dt.days

df['waiting_time']=df['waiting_time'].astype(int)

df.head()
Out[27]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show scheduledday_date appointmentday_date waiting_time
0 2.987250e+13 5642903 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00 62 JARDIM DA PENHA 0 1 0 0 0 0 No 2016-04-29 2016-04-29 0
1 5.589978e+14 5642503 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0 0 0 0 0 0 No 2016-04-29 2016-04-29 0
2 4.262962e+12 5642549 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00 62 MATA DA PRAIA 0 0 0 0 0 0 No 2016-04-29 2016-04-29 0
3 8.679512e+11 5642828 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No 2016-04-29 2016-04-29 0
4 8.841186e+12 5642494 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0 1 1 0 0 0 No 2016-04-29 2016-04-29 0
In [28]:
df.drop(df.loc[df['age']<0].index,axis=0,inplace=True)

df.loc[df['age']<0]
Out[28]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show scheduledday_date appointmentday_date waiting_time

Analysis¶

In [29]:
df['no_show']=df['no_show'].astype('category')
df['no_show']=df['no_show'].cat.codes
print(df['no_show'].dtypes)
int8
In [30]:
df.corr()
f, ax = plt.subplots(figsize=(10,10))

sas.heatmap(df.corr(),annot=True);
In [32]:
# plot 

def plotMygraph(possition, dataPoint, title):
    plt.subplot(3,2,possition)
    hip_mab=dataPoint.map({1:'yes',0:'no'})
    sas.countplot(hip_mab,data=df,hue='no_show')
    plt.title(title,fontsize=15)
    plt.legend(title='show',labels=['no','yes'])

    

#hipertension   
plotMygraph(1, df['hipertension'], 'Hipertension effect to patient show')
#for diabetes
plotMygraph(2, df['diabetes'], 'Diabetes effect to patient show') 
#handcap
plotMygraph(3, df['handcap'], 'Handcap effect to patient show') 
#alcoholism
plotMygraph(4, df['alcoholism'], 'Alcoholism effect to patient show') 
#sms_received
plotMygraph(5, df['sms_received'], 'SMS delivery effect to patient show')
#scholarship
plotMygraph(6, df['scholarship'], 'Scholarship effect to patient show') 



plt.subplots_adjust(left=0,right=1.5,bottom=0,top=2.5,wspace=0.3,hspace=0.3)
/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/Users/itailouiszulu/tensorflow-test/env/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

Conclusions¶

  • Clinic locations.
    • attendance rates vary from one place to another. It can be said that the clinic locations affect the attendance rates of patients.
  • Gender,

    • females are more not to show up than males,
  • diseases

    • they do not clearly affect attendance rates

Limititions.¶

  • Age have outliers
  • The number of females is great than males

  • the number of females is great than males

  • There is a difference in age for different age stages
In [ ]: